The maUEB package contains functions to analyze data from microarray experiments. This vignette shows an example of use of maUEB package to run a complete microarray analysis. The workflow includes the following steps:
The example is based on a study deposited in the Gene Expression Omnibus (GEO) and published in ref. The study is based on 12 samples hybridized in Clariom S Mouse Arrays from Thermofisher. The experiment condition considered in this study is:
For illustration purposes, an imaginary Batch factor will be added to the sample data.
The following group of comparisons will be performed to determine the effects of each treatment on gene expression:
if (!require(maUEB)){
require(devtools)
install_github("uebvhir/maUEB", build_vignettes = FALSE)
}
## Loading required package: maUEB
##
## Registered S3 method overwritten by 'enrichplot':
## method from
## fortify.enrichResult DOSE
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
## Registered S3 methods overwritten by 'ff':
## method from
## maxindex.default bit
## poslength.default bit
## maxindex.bitwhich bit
## maxindex.bit bit
## poslength.bitwhich bit
## poslength.bit bit
## Warning: replacing previous import 'limma::backgroundCorrect' by
## 'oligo::backgroundCorrect' when loading 'maUEB'
## Warning: replacing previous import 'heatmaply::normalize' by 'oligo::normalize'
## when loading 'maUEB'
## Warning: replacing previous import 'AnnotationDbi::select' by 'plotly::select'
## when loading 'maUEB'
## Warning: replacing previous import 'ggplot2::last_plot' by 'plotly::last_plot'
## when loading 'maUEB'
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when
## loading 'maUEB'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head' when
## loading 'maUEB'
## Warning: replacing previous import 'ReportingTools::page' by 'utils::page' when
## loading 'maUEB'
# Set working directory (it paths to vignette directory)
workingDir <- here::here("vignettes")
# Parameters directory
paramDir <- file.path(workingDir, "parameters")
source(file.path(paramDir, "global_parameters.par"))
For this example study, global parameters were set as follows:
(global_parameters <- read.table(file.path(paramDir, "global_parameters.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 dataDirN <- dades
## 2 celDirN <- celfiles
## 3 resultsDirN <- results
## 4 client <- RSRCHR
## 5 ID <- STUDY
## 6 descriptors <- c(FileName,Group, ShortName, Colors, Batch)
## 7 annotPackage <- clariomsmousetranscriptcluster.db
## 8 organismsp <- mouse
dataDir <- file.path(workingDir, dataDirN)
celDir <- file.path(workingDir, celDirN)
resultsDir <- file.path(workingDir, resultsDirN)
The build_targets() function can be used to build a
targets template dataframe that will then be manually filled. The
targets template built will contain the filename of the arrays in rows
and the variables of interest in columns. First column (FileName)
contains the names of the cel files contained in the celfiles directory
specified in parameters. The remaining columns correspond to the
descriptors specified in global parameters, among which there will be
the following: Group, ShortName, and Colors. Other variables may be Batch, PatientID,
or other variables of interest that may be added by the user. The
targets template built is saved as a semicolon-separated file named targets.RSRCHR.STUDY.template.csv, where RSRCHR refers to client acronym and STUDY refers to the study UEB id.
An example of the targets generated for this study is shown below:
build_targets(inputDir=celDir, outputDir=dataDir, client=client, ID=ID, descriptors=descriptors)
(targets_template <- read.table(file.path(dataDir, paste0("targets.", client, ".", ID, ".template.csv")), header = TRUE, sep = ";", as.is=TRUE))## Compte! la doble cometa l'interpreta com a accent greu si no es canvia a R)
## FileName Group ShortName Colors Batch
## 1 GSM3730337_A1-LT_1Wk.CEL 0 0 0 0
## 2 GSM3730338_A2-LT_1Wk.CEL 0 0 0 0
## 3 GSM3730339_A3-LT_1Wk.CEL 0 0 0 0
## 4 GSM3730340_A4-LT_1Wk.CEL 0 0 0 0
## 5 GSM3730341_B1-LT_1Wk.CEL 0 0 0 0
## 6 GSM3730342_B2-LT_1Wk.CEL 0 0 0 0
## 7 GSM3730343_B3-LT_1Wk.CEL 0 0 0 0
## 8 GSM3730344_B4-LT_1Wk.CEL 0 0 0 0
## 9 GSM3730345_E1-LT_1Wk.CEL 0 0 0 0
## 10 GSM3730346_E2-LT_1Wk.CEL 0 0 0 0
## 11 GSM3730347_E4-LT_1Wk.CEL 0 0 0 0
## 12 GSM3730348_E5-LT_1Wk.CEL 0 0 0 0
Then, the information for each descriptor needs to be filled manually in excel/libreoffice and saved as csv file targets.RSRCHR.STUDY.csv using ; as field separator.
#Parameters
source(file.path(paramDir, "01-Load-QC-Norm-Filt.par"))
#Execution parameters
source(file.path(paramDir, "01-Load-QC-Norm-Filt_analysistodo.par"))
For this example study, parameters were set as follows:
(loadQCNormfilt_parameters <- read.table(file.path(paramDir, "01-Load-QC-Norm-Filt.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 dataDirN <- dades
## 2 celDirN <- celfiles
## 3 resultsDirN <- results
## 4 resultsSummFN <- ResultsSummary_Load-QC-Norm-Filt.txt
## 5 targetsFN <- targets.RSRCHR.STUDY.csv
## 6 targets.fact <- c(Group, Batch)
## 7 batchcolName <- Batch
## 8 colorlist <- c(firebrick1, blue, darkgoldenrod1, darkorchid1, lightblue, orange4, seagreen, aquamarine3, red4, chartreuse2, plum, darkcyan, darkolivegreen3, forestgreen, gold,khaki1, lightgreen, gray67, deeppink)
## 9 annotPackage <- clariomsmousetranscriptcluster.db
## 10 samplestoremove <- CMP.2
## 11 normalize.method <- oligo::rma
## 12 filterByVar.fun <- IQR
## 13 filterByVar.thr <- 0.65
## 14 pcaRawData.fact <- targets.fact
## 15 pcaRawData.scale <- FALSE
## 16 pcaNormData.fact <- targets.fact
## 17 pcaNormData.scale <- FALSE
## 18 pct_threshold <- 0.6
## 19 pvcaNormData.fact <- targets.fact
## 20 targets.pvcaFN <- NULL
## 21 hclustRawData.method <- average
## 22 hclustNormData.method <- average
The execution parameters were set as follows:
(loadQCNormfilt_todoparameters <- read.table(file.path(paramDir, "01-Load-QC-Norm-Filt_analysistodo.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 readcelFiles <- TRUE
## 2 loadRawData <- FALSE
## 3 loadNormData <- FALSE
## 4 normalizedata <- TRUE
## 5 QCraw <- TRUE
## 6 boxplotRawData <- QCraw
## 7 pcaRawData <- QCraw
## 8 heatmapRawData <- QCraw
## 9 arrayQMRawData <- QCraw
## 10 QCnorm <- TRUE
## 11 boxplotNormData <- QCnorm
## 12 pcaNormData <- QCnorm
## 13 pcaNormData.corrbatch <- TRUE
## 14 heatmapNormData <- QCnorm
## 15 arrayQMNormData <- QCnorm
## 16 pvcaNormData <- TRUE
## 17 SDplot <- TRUE
## 18 filterByAnnot <- TRUE
## 19 filterByVar <- FALSE
## 20 removingofsamples <- TRUE
## 21 save_annot_all <- TRUE
## 22 save_annot_filt <- TRUE
Function read_targets reads the targets file specified in targetsFN (here targets.RSRCHR.STUDY.csv) that was generated in section 3.1.5. The cel file names contained in column FileName are set as rownames and factor variables specified by parameter targets.fact are converted to factors.
(targets <- read_targets(inputDir=dataDir, targetsFN=targetsFN, targets.fact=targets.fact))
## FileName Group ShortName Colors Batch
## CTL.1 GSM3730337_A1-LT_1Wk.CEL CTL CTL.1 aquamarine3 1
## CTL.2 GSM3730338_A2-LT_1Wk.CEL CTL CTL.2 aquamarine3 2
## CTL.3 GSM3730339_A3-LT_1Wk.CEL CTL CTL.3 aquamarine3 1
## CTL.4 GSM3730340_A4-LT_1Wk.CEL CTL CTL.4 aquamarine3 2
## PD1.1 GSM3730341_B1-LT_1Wk.CEL PD1 PD1.1 firebrick 1
## PD1.2 GSM3730342_B2-LT_1Wk.CEL PD1 PD1.2 firebrick 2
## PD1.3 GSM3730343_B3-LT_1Wk.CEL PD1 PD1.3 firebrick 1
## PD1.4 GSM3730344_B4-LT_1Wk.CEL PD1 PD1.4 firebrick 2
## CMP.1 GSM3730345_E1-LT_1Wk.CEL CMP CMP.1 chartreuse3 1
## CMP.2 GSM3730346_E2-LT_1Wk.CEL CMP CMP.2 chartreuse3 2
## CMP.3 GSM3730347_E4-LT_1Wk.CEL CMP CMP.3 chartreuse3 1
## CMP.4 GSM3730348_E5-LT_1Wk.CEL CMP CMP.4 chartreuse3 2
The targets contains 12 rows and 5 columns. A summary of the variables is shown below:
summary(targets)
## FileName Group ShortName Colors Batch
## Length:12 CTL:4 Length:12 Length:12 1:6
## Class :character PD1:4 Class :character Class :character 2:6
## Mode :character CMP:4 Mode :character Mode :character
Raw data can be read from the cel files or directly loaded from an existing Rda object ("rawData.Rda"). Function read_celfiles() reads the cel files listed in targets rownames (and in the same order) that are contained in the celDir directory and returns an object of class ExpressionSet.
if (readcelFiles){
eset_raw <- read_celfiles(inputDir=celDir, targets=targets)
} else {if (loadRawData) {load(file.path(dataDir, "rawData.Rda"))}}
## Loading required package: pd.clariom.s.mouse.ht
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: RSQLite
## Loading required package: oligoClasses
## Welcome to oligoClasses version 1.46.0
## Loading required package: oligo
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## ================================================================================
## Welcome to oligo version 1.48.0
## ================================================================================
## Loading required package: DBI
## Platform design info loaded.
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730337_A1-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730338_A2-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730339_A3-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730340_A4-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730341_B1-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730342_B2-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730343_B3-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730344_B4-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730345_E1-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730346_E2-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730347_E4-LT_1Wk.CEL
## Reading in : /media/mferrer/Datos/Mireia/Scripts/Paquets_mireia/maUEB/vignettes/celfiles/GSM3730348_E5-LT_1Wk.CEL
The ExpressionSet object contains the following fields:
assayData: A matrix of expression values, where the rows represent probe sets (features)
and columns represent samples. Row and column names must be unique, and
consistent with row names of featureData and phenoData, respectively.
The assay data can be retrieved with exprs(). Note that it can be subsetted though not reassigned. To reassign a new expression matrix one must create a new ExpressionSet.
phenoData: An AnnotatedDataFrame containing
information about each sample as defined in the targets file. The number
of rows in phenoData must match the number of columns in assayData. Row
names of phenoData must match column names of the matrix in assayData.
The phenoData can be retrieved or assigned with pData().
annotation: A character describing the platform on
which the samples were assayed. This is often the name of a Bioconductor
chip annotation package. Can be retrieved or assigned with annotation(). For ClariomS/D it is automatically filled by the platform chip annotation.
eset_raw
## ExpressionFeatureSet (storageMode: lockedEnvironment)
## assayData: 300304 features, 12 samples
## element names: exprs
## protocolData
## rowNames: CTL.1 CTL.2 ... CMP.4 (12 total)
## varLabels: exprs dates
## varMetadata: labelDescription channel
## phenoData
## rowNames: CTL.1 CTL.2 ... CMP.4 (12 total)
## varLabels: FileName Group ... Batch (5 total)
## varMetadata: labelDescription channel
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: pd.clariom.s.mouse.ht
In this example study, the expression matrix of raw data contains 300304 rows (features) and 12 columns (samples).
From now on, we will work with the ExpressionSet object eset_raw. This means that the phenotypic, annotation and expression data will be accessed from that object.
Performs different exploratory analyses (Boxplot, PCA, Heatmap of sample distances and hierarchical clustering) to inspect graphically the quality of samples from raw data. As well, a report of QC is generated using the array quality metrics package.
Creates a boxplot of log2-transformed intensity values from raw data and saves it as pdf into the results directory.
if (boxplotRawData){
qc_boxplot(data=eset_raw, group="Group", group.color="Colors", samplenames="ShortName", outputDir=resultsDir, label="RawData")
}
Performs a principal component analysis of raw data and creates
different PCA plots with samples colored for each factor variable
specified in 'pcaRawData.fact' from parameters. The PCA plot can be
performed in 2 and/or 3 dimensions (representing the first 2 or 3
principal components) by setting the dim parameter. The loads (percentage of variance) cumulated for each of the principal components are returned.
if (pcaRawData){
loadsPCAraw <- qc_pca1(data=exprs(eset_raw), scale=pcaRawData.scale, pca2D_factors=pcaRawData.fact, targets=targets, col.group="Colors", colorlist=colorlist, names=targets$ShortName, outputDir=resultsDir, label="RawData")
}
#si el volem en 3 dimensions:
if (pcaRawData){
loadsPCAraw <- qc_pca1(data=exprs(eset_raw), pca3D=TRUE, scale=pcaRawData.scale, dim=c(1,2,3), pca3D_factors="Group", targets=targets, col.group="Colors", colorlist=colorlist, names=targets$ShortName, outputDir=resultsDir, label="RawData")
}
if (heatmapRawData){
qc_hc(data=exprs(eset_raw), hclust.method=hclustRawData.method, names=targets$ShortName, cexRow = 0.6, cexCol = 0.6, rect=TRUE, numclusters=2, outputDir=resultsDir, label="RawData")
}
## png
## 2
#Note: Set 'intgroup' according to factors to be colored in heatmap. Other plots will be colored according to only the first factor
#required for Affymetrix QC
if (arrayQMRawData) {
require(arrayQualityMetrics)
arrayQualityMetrics(eset_raw, outdir = file.path(resultsDir, "QCDir.Raw"), force=TRUE, intgroup=targets.fact)
}
All the QC plots performed above can be performed in one step with function qc_all:
qc_all(data=eset_raw, group="Group", group.color="Colors", samplenames="ShortName", factors=pcaRawData.fact, pca_scale=pcaRawData.scale, colorlist=colorlist, hc.method=hclustRawData.method, label="RawData", outputDir=resultsDir, summaryFN=resultsSummFN, doboxplot=TRUE, dopca=TRUE, dopvca=FALSE, dohc=TRUE, doarrayQMreport=FALSE)
Data can be normalized using normalization() function or directly loaded from an Rda object (normData.Rda) previously generated. The normalization()
function normalizes raw data using one of the available methods
(currently: rma method from oligo package). The RMA method allows
background subtraction, quantile normalization and summarization (via
median-polish) (ref). It returns a FeatureSet of normalized
data and saves the normalized expression values in a datasheet (csv
and/or xls file format) in the results folder.
if (loadNormData) {
load(file.path(dataDir, "normData.Rda"))
} else if (normalizedata) {
eset_norm <- normalization(data=eset_raw, norm.method="rma", annotPkg=annotPackage, outputFN="Normalized.All", outputDir=resultsDir)
}
## Background correcting
## Normalizing
## Calculating Expression
Constructs an aafTable object given a set of probe ids using aafTableAnn function from annaffy package. The dataframe with annotations is returned as an aafTable
object and it also saves the annotation for all summarized probes as
csv and html files. Note: in Exon studies, where probes are summarized
at the probeset level but not transcript level, the probe annotations
cannot be recovered using this function.
#no se pq dona error, nomes passa quan s'executa com a funcio sense haver previament carregat els paquets (?)
require(annotPackage, character.only=TRUE)
require(annaffy)
if (save_annot_all){normData_annot <- save_annotations(data=rownames(eset_norm), annotPkg=annotPackage, outputFN="Annotations.AllGenes", saveHTML=TRUE, title="Annotations for all genes", outputDir=resultsDir)}
Here, the same functions used for quality control of raw data can be used to assess the quality of data after normalization. As an extra analysis, a PVCA (principal variance component analysis) can also be performed to further assess the source of batch effects in normalize data.
Creates a boxplot of the normalized data (already in log2 scale) and saves it as pdf into the results directory.
if (boxplotNormData){
qc_boxplot(data=eset_norm, group="Group", col="Colors", names="ShortName", outputDir=resultsDir, label="NormData")
}
## Warning in .local(x, ...): Argument 'which' ignored (not meaningful for
## ExpressionSet)
## Warning in .local(x, ...): Argument 'which' ignored (not meaningful for
## ExpressionSet)
Performs a principal component analysis of normalized data and creates different PCA plots with samples colored for each factor variable specified in 'pcaNormData.fact' from parameters. The loads (percentage of variance) cumulated for each of the principal components are returned.
if (pcaNormData){
loadsPCAnorm <- qc_pca1(data=exprs(eset_norm), scale=pcaNormData.scale, pca2D_factors=pcaNormData.fact, targets=targets, col.group="Colors", colorlist=colorlist, names=targets$ShortName, outputDir=resultsDir, label="NormData")
}
If there are batch effects, the PCA can be repeated removing the batch effects using the removeBatchEffect function from limma package. Here this step is performed only for illustration purposes, since the Batch variable is artificial. We use the same function as for the PCA but specify batchRemove=TRUE and the name of Batch variable in batchFactors parameter.
if (pcaNormData.corrbatch){
loadsPCAnorm.corrbatch <- qc_pca1(exprs(eset_norm), scale=pcaNormData.scale, pca2D_factors=pcaNormData.fact, targets=targets, col.group="Colors", colorlist=colorlist, names=targets$ShortName, outputDir=resultsDir, label="NormData", batchRemove=TRUE, batchFactors="Batch", size = 1.5, glineas = 0.25)
}
if (heatmapNormData){
qc_hc(data=exprs(eset_norm), hclust.method=hclustNormData.method, names=targets$ShortName, cexRow = 0.6, cexCol = 0.6, rect=TRUE, numclusters=2, outputDir=resultsDir, label="NormData")
}
## png
## 2
#Note: Set 'intgroup' according to factors to be colored in heatmap. Other plots will be colored according to only the first factor
require(arrayQualityMetrics) #required for Affymetrix QC
if (arrayQMNormData) arrayQualityMetrics(eset_norm, outdir = file.path(resultsDir, "QCDir.Norm"), force=TRUE, intgroup=targets.fact)
The PVCA approach can be used as a screening tool to determine which sources of variability (biological, technical or other) are most prominent in a given microarray data set.
if (pvcaNormData){
qc_pvca(data=eset_norm, factors=pvcaNormData.fact, targetsPVCA=targets.pvcaFN, pct_threshold=pct_threshold, label="NormData", outputDir=resultsDir, summaryFN=resultsSummFN)
}
## $dat
## [,1] [,2] [,3] [,4]
## [1,] 0.04757598 0.2911197 0.01818642 0.6431179
##
## $label
## [1] "Group:Batch" "Group" "Batch" "resid"
All the QC plots performed above can be performed in one step with function qc_all:
qc_all(data=eset_norm, group="Group", group.color="Colors", samplenames="ShortName", factors=pcaNormData.fact, factorspvca=pvcaNormData.fact, pca_scale=pcaNormData.scale, colorlist=colorlist, hc.method=hclustNormData.method, label="NormData", outputDir=resultsDir, summaryFN=resultsSummFN, doboxplot=TRUE, dopca=TRUE, dopvca=TRUE, dohc=TRUE, doarrayQMreport=FALSE)
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## boundary (singular) fit: see ?isSingular
## $dat
## [,1] [,2] [,3] [,4]
## [1,] 0.04757598 0.2911197 0.01818642 0.6431179
##
## $label
## [1] "Group:Batch" "Group" "Batch" "resid"
The Quality Control may reveal some outlier samples that should be
removed before proceeding with the differential expression analysis.
Here, sample CMP.2 appears as a putative outlier both in
the hierarchical clustering and PCA analysis. Therefore, this sample
will be discarded from the analysis.
To remove samples, the whole ExpressionSet object can be subsetted as follows:
samplestoremove <- "CMP.2"
eset_raw.f <- eset_raw[,-which(colnames(eset_raw)%in%samplestoremove)]
dim(eset_raw)
## Features Samples
## 300304 12
dim(eset_raw.f)
## Features Samples
## 300304 11
Note that the phenoData contained in that object has also been subsetted:
pData(eset_raw.f)
## FileName Group ShortName Colors Batch
## CTL.1 GSM3730337_A1-LT_1Wk.CEL CTL CTL.1 aquamarine3 1
## CTL.2 GSM3730338_A2-LT_1Wk.CEL CTL CTL.2 aquamarine3 2
## CTL.3 GSM3730339_A3-LT_1Wk.CEL CTL CTL.3 aquamarine3 1
## CTL.4 GSM3730340_A4-LT_1Wk.CEL CTL CTL.4 aquamarine3 2
## PD1.1 GSM3730341_B1-LT_1Wk.CEL PD1 PD1.1 firebrick 1
## PD1.2 GSM3730342_B2-LT_1Wk.CEL PD1 PD1.2 firebrick 2
## PD1.3 GSM3730343_B3-LT_1Wk.CEL PD1 PD1.3 firebrick 1
## PD1.4 GSM3730344_B4-LT_1Wk.CEL PD1 PD1.4 firebrick 2
## CMP.1 GSM3730345_E1-LT_1Wk.CEL CMP CMP.1 chartreuse3 1
## CMP.3 GSM3730347_E4-LT_1Wk.CEL CMP CMP.3 chartreuse3 1
## CMP.4 GSM3730348_E5-LT_1Wk.CEL CMP CMP.4 chartreuse3 2
targets.f <- pData(eset_raw.f)
After sample removal, the whole QC and normalization process should be repeated without those samples.
Hence, data is normalized with the new dataset (here we don't change the outputFN parameter so that files Normalized.All.xls and Normalized.All.csv are overwritten):
eset_norm.f <- normalization(data=eset_raw.f, norm.method="rma", annotPkg=annotPackage, outputFN="Normalized.All", outputDir=resultsDir)
## Background correcting
## Normalizing
## Calculating Expression
The QC is performed again with the normalized data after sample
exclusion. We will indicate that the plots correspond to the dataset
after sample exclusion by adding .f in label parameter:
qc_all(data=eset_norm.f, group="Group", group.color="Colors", samplenames="ShortName", factors=pcaNormData.fact, factorspvca=pvcaNormData.fact, pca_scale=pcaNormData.scale, colorlist=colorlist, hc.method=hclustNormData.method, label="NormData.f", outputDir=resultsDir, summaryFN=resultsSummFN, doboxplot=TRUE, dopca=TRUE, dopvca=FALSE, dohc=TRUE, doarrayQMreport=FALSE)
If there were batch effects, we should also repeat the PCA removing batch effects with the new dataset:
loadsPCAnorm.f.corrbatch <- qc_pca1(exprs(eset_norm.f), scale=pcaNormData.scale, pca2D_factors=pcaNormData.fact, targets=targets.f, col.group="Colors", colorlist=colorlist, names=targets.f$ShortName, outputDir=resultsDir, label="NormData.f", batchRemove=TRUE, batchFactors="Batch", size = 1.5, glineas = 0.25)
Some filtering can be performed prior to differential expression
analysis to remove the affymetrix control probes, as well as probesets
with missing or duplicate EntrezID (annotation-based
filtering); and/or to remove probesets with low variance (variance-based
filtering). These options are specified in the following parameters: filtByAnnot and filtByVar, respectively. The resulting object is named eset_filt.
Function sdplot() can be used to plot the standard deviations of all probesets in the array.
if (SDplot){
sdplot(data=exprs(eset_norm.f), var_thr=filterByVar.thr, label="NormData", outputDir=resultsDir)
}
## png
## 2
eset_filtered <- filtering(data=eset_norm.f, outputFN="Normalized.Filtered", outputDir=resultsDir, summaryFN=resultsSummFN,
feature.exclude="^AFFX",require.entrez=filterByAnnot, remove.dupEntrez = filterByAnnot,
var.filter=filterByVar, var.func = filterByVar.fun, var.cutoff = filterByVar.thr, filterByQuantile=TRUE,
require.GOBP = FALSE, require.GOCC = FALSE, require.GOMF = FALSE)
##
##
dim(eset_filtered)
## Features Samples
## 20296 11
#no se pq dona error, (Error: no such table: go.go_term) nomes passa quan s'executa com a funcio sense haver previament carregat els paquets (?)
require(annotPackage, character.only=TRUE)
require(annaffy)
## Loading required package: annaffy
## Loading required package: GO.db
## Loading required package: KEGG.db
##
## KEGG.db contains mappings based on older data because the original
## resource was removed from the the public domain before the most
## recent update was produced. This package should now be considered
## deprecated and future versions of Bioconductor may not have it
## available. Users who want more current data are encouraged to look
## at the KEGGREST or reactome.db packages
if (save_annot_filt){normData.filt_annot <- save_annotations(data=rownames(eset_filtered), annotPkg=annotPackage, outputFN="Annotations.Filtered", saveHTML=TRUE, title="Annotations for filtered genes", outputDir=resultsDir)}
# When you save your data, use save(MyObject, file = "MyObject.RData", version = 2) to maintain back-compatibility and avoid the warning. (otherwise, others using R < 3.5.0 won't be able to load your saved files.)
if (!is.null(samplestoremove)){
if (normalizedata) {
save(eset_raw, eset_norm, eset_raw.f, eset_norm.f, eset_filtered, targets, targets.f, file=file.path(dataDir,"normData.Rda"), version=2)
} else {
save(eset_raw, eset_raw.f, targets, targets.f, file=file.path(dataDir,"rawData.Rda"), version=2)
}
} else {
if (normalizedata) {
save(eset_raw, eset_norm, eset_filtered, targets, file=file.path(dataDir,"normData.Rda"), version=2)
} else {
save(eset_raw, targets, file=file.path(dataDir,"rawData.Rda"), version=2)
}
}
The analysis to select differentially expressed genes will be based on adjusting a linear model with empirical Bayes moderation of the variance. This is a technique specifically developed for microarray data analysis by Gordon K Smyth @smyth2004. Linear modeling is the same as ordinary analysis of variance or multiple regression except that a model is fitted for every gene. Of note, limma estimates the gene-specific variance using information from all samples in all groups, even if not all groups participate in the pairwise comparisons. This is necessary in 'omics studies with limited replication, and outweighs any supposed disadvantage of differences in variances between groups. For statistical analysis and assessing differential expression, limma uses an empirical Bayes method to moderate the standard errors of the estimated log-fold changes. This results in more stable inference and improved power, especially for experiments with small numbers of arrays.
The main purpose of fitting a linear model to the data is to estimate the variability in the data, hence the systematic part needs to be modelled so it can be distinguished from random variation. The model is specified by the design matrix. Each row of the design matrix corresponds to an array in your experiment and each column corresponds to a coefficient that is used to describe the RNA sources in your experiment. Then, a contrasts step is performed so that you can take the initial coefficients and compare them in as many ways as you want to answer any questions you might have, regardless of how many or how few these might be. (ref: limma user's guide)
General recommendations:
Group variable already integrates the information of Sex, no additional Sex covariate should be added to the model). Adding correlated variables may cause the design matrix to be unranked.Batch variable should be added to the model as a systematic blocking factorReferences on how to build design matrices:
#Parameters
source(file.path(paramDir, "02-DEA.par"))
#Execution parameters
source(file.path(paramDir, "02-DEA_analysistodo.par"))
For this example study, parameters were set as follows:
(loadDEA_parameters <- read.table(file.path(paramDir, "02-DEA.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 dataDirN <- dades
## 2 resultsDirN <- results
## 3 resultsSummFN <- ResultsSummary_DEA.txt
## 4 inputDEARda <- normData.Rda
## 5 eset_analysis <- eset_filtered
## 6 annotPackage <- clariomsmousetranscriptcluster.db
## 7 design.mainfact <- Group
## 8 design.cofact <- Batch
## 9 blocking.fact <- NULL
## 10 design.num <- NULL
## 11 design.num.targ <- NULL
## 12 contrastsv <- c(PD1vsCTL = PD1 - CTL,
## 13 CMPvsCTL = CMP - CTL,
## 14 PD1vsCMP = PD1 - CMP)
## 15 ncomp <- length(contrastsv)
## 16 toptable_padjust_method <- c(none, BH, BH)
## 17 volc_logFC <- c(0.5, 1, 1)
## 18 volc_pval <- c(P.Value, adj.P.Val, adj.P.Val)
## 19 volc_pval.thr <- rep(0.05, 3)
## 20 volc_x0 <- rep(-3, ncomp)
## 21 volc_x1 <- rep(+3, ncomp)
## 22 volc_y0 <- rep(0, ncomp)
## 23 volc_y1 <- rep(5, ncomp)
The execution parameters were set as follows:
(loadDEA_todoparameters <- read.table(file.path(paramDir, "02-DEA_analysistodo.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 ReportHTMLTopTab <- TRUE
## 2 Volcanos <- TRUE
Since directories could change from block to block, we set them again for this block:
dataDir <- file.path(workingDir, dataDirN)
resultsDir <- file.path(workingDir, resultsDirN)
The Rda file generated in Block 01-Load-QC-Norm-Filt will be used as input for DEA:
load(file.path(dataDir, inputDEARda))
We assign the ExpressionSet object to be used for the analysis to an R object (it is specified in the eset_analysis parameter as a string)
eset_analysis <- eval(parse(text=eset_analysis))
The design matrix is created from the phenotypic data, which can be retrieved from the ExpressionSet object using pData(eset_analysis) or from the targets object. The safer way is to work with the ExpressionSet
object, instead of using the expression data and targets file
separately. Note that it constructs the design matrix based only on
targets sample name and then the fit is done based on the indexed
positions.
Function dea_lmdesign() can be used to construct a design matrix. The main factor representing the Group of treatment should be provided in group parameter, and the covariates or systematic effects to take into account will be introduced in the covariates parameter. The design matrix will be constructed according to the formula ~ 0 + group + covariates.
This design is the most optimal and flexible to enable different
comparisons specified by the contrasts matrix. However, alternative
designs can also be created by providing a formula to the fmla parameter.
In addition, if random effects are to be taken into account, this will be specified in blocking.fact parameter and taken into account via duplicateCorrelation() function during the model fit (see next section).
In this example, the artificially created Batch factor will be added to the model for illustration purposes only.
#parameters specified in file 02-DEA.par:
design.mainfact
## [1] "Group"
design.cofact
## [1] "Batch"
blocking.fact
## NULL
(design <- dea_lmdesign(targets=pData(eset_analysis), sampleNames="ShortName", data=eset_analysis,
group=design.mainfact, cov.fact=design.cofact, cov.num=design.num, fmla=NULL,
summaryFN=resultsSummFN, outputDir=resultsDir))
## CTL PD1 CMP Batch2
## CTL.1 1 0 0 0
## CTL.2 1 0 0 1
## CTL.3 1 0 0 0
## CTL.4 1 0 0 1
## PD1.1 0 1 0 0
## PD1.2 0 1 0 1
## PD1.3 0 1 0 0
## PD1.4 0 1 0 1
## CMP.1 0 0 1 0
## CMP.3 0 0 1 0
## CMP.4 0 0 1 1
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$Group
## [1] "contr.treatment"
##
## attr(,"contrasts")$Batch
## [1] "contr.treatment"
Function dea_lmfit() can be used to fit the linear model to the data, as specified by the design matrix. It uses lmFit function from limma package. If random effects are to be taken into account, this should be specified in the block.cor parameter (i.e. blocking.fact parameter is not null), the correlation between measurements made on the same block will be estimated using duplicateCorrelation function from limma package and used to fit to the model. The calculated correlation consensus can be accessed from the object fit returned (fit$correlation).
fit <- dea_lmfit(data=eset_analysis, targets=pData(eset_analysis), design=design, block.cor=NULL, summaryFN=resultsSummFN, outputDir=resultsDir)
The contrasts to be computed are defined in contrastsv parameter of 02-DEA.par file, where the contrasts names will be the coefficients to be extracted from the fit object. Function dea_compare()
allows to construct the contrast matrix and compute estimated
coefficients and standard errors for the given set of contrasts. In
addition, it performs an empirical Bayes moderation of the standard
errors. In the case where no moderation of the standard errors is to be
performed (not recommended), this can be achieved by specifying moderated=FALSE parameter. The object returned fit.main.
fit.main <- dea_compare(fit=fit, contrasts=contrastsv, design=design, moderated=TRUE, summaryFN=resultsSummFN, outputDir=resultsDir)
names(fit.main)
## [1] "coefficients" "rank" "assign" "qr"
## [5] "df.residual" "sigma" "cov.coefficients" "stdev.unscaled"
## [9] "Amean" "method" "design" "contrasts"
## [13] "df.prior" "s2.prior" "var.prior" "proportion"
## [17] "s2.post" "t" "df.total" "p.value"
## [21] "lods" "F" "F.p.value"
Assign name of comparisons (contrasts) to listofcoef variable:
(listofcoef <- colnames(fit.main))
## [1] "PD1vsCTL" "CMPvsCTL" "PD1vsCMP"
Function dea_toptab allows to obtain the toptables for the comparisons analyzed. To create html tables use html_report=TRUE (it will spend more time in execution).
#versio amb les html tables com venen per defecte (nomes mostra adj pval o pval)
listofcsv <- dea_toptab(listofcoef=listofcoef, fit.main=fit.main, eset=eset_analysis, padjust.method="fdr",
html_report=ReportHTMLTopTab,
html_ntop=500, html_group="Group", html_padjust_method = toptable_padjust_method,
outputDir=resultsDir)
head(listofcsv[[1]])
## ProbesetID Gene.Symbol EntrezID logFC AveExpr
## TC0200003103.mm.2 TC0200003103.mm.2 Mastl 67121 -0.7743359 6.578679
## TC0500000877.mm.2 TC0500000877.mm.2 Stbd1 52331 0.7601741 2.600744
## TC0200003303.mm.2 TC0200003303.mm.2 Lcn2 16819 -2.0423248 5.720048
## TC1700000991.mm.2 TC1700000991.mm.2 Zfp959 224893 -0.7598567 6.173209
## TC1100001853.mm.2 TC1100001853.mm.2 Sox9 20682 0.6264528 7.191829
## TC0300000694.mm.2 TC0300000694.mm.2 Fcrl1 229499 0.7608916 4.444347
## t P.Value adj.P.Val B CTL.1
## TC0200003103.mm.2 -6.581913 3.475403e-05 0.50501 0.99628871 6.967894
## TC0500000877.mm.2 5.344704 2.147716e-04 0.50501 -0.01539208 1.991386
## TC0200003303.mm.2 -5.334615 2.181767e-04 0.50501 -0.02474273 6.888696
## TC1700000991.mm.2 -5.267342 2.423942e-04 0.50501 -0.08757404 6.569216
## TC1100001853.mm.2 5.258972 2.456002e-04 0.50501 -0.09544905 6.903068
## TC0300000694.mm.2 5.154835 2.894639e-04 0.50501 -0.19452520 4.339100
## CTL.2 CTL.3 CTL.4 PD1.1 PD1.2 PD1.3
## TC0200003103.mm.2 7.022167 7.054732 7.091516 6.087973 6.264468 6.286478
## TC0500000877.mm.2 1.965754 2.345005 2.278587 2.816463 2.955682 3.105219
## TC0200003303.mm.2 6.270977 5.658767 6.677155 4.459839 4.391253 3.440885
## TC1700000991.mm.2 6.696533 6.520853 6.525938 5.967611 5.956430 5.513731
## TC1100001853.mm.2 6.840660 6.837817 6.666565 7.452319 7.572522 7.359331
## TC0300000694.mm.2 4.023021 4.164470 3.923459 5.092097 4.900643 5.031838
## PD1.4 CMP.1 CMP.3 CMP.4
## TC0200003103.mm.2 6.400044 6.327643 6.289229 6.573320
## TC0500000877.mm.2 2.744064 2.703738 2.893336 2.808951
## TC0200003303.mm.2 5.034319 6.340505 5.728454 8.029680
## TC1700000991.mm.2 5.835340 6.136566 5.826947 6.356136
## TC1100001853.mm.2 7.369749 7.418443 7.327211 7.362429
## TC0300000694.mm.2 4.469038 4.475632 4.142661 4.325857
#versio amb les html tables modificades pq mostriin tots els estadistics. No l'executo pq triga molt. A millorar
listofcsv <- dea_toptab1(listofcoef=listofcoef, fit.main=fit.main, eset=eset_analysis, padjust.method="fdr",
html_report=TRUE, #ReportHTMLTopTab,
html_ntop=500, html_group="Group",
outputDir=resultsDir)
names(listofcsv)
head(listofcsv[[1]])
Function dea_summary_ngc can be used to summarize the
results of differential expression. It returns a table with the number
of features that are differentially expressed under different
statistical thresholds. They are catalogued as Up or Down-regulated if their logFC is above or below a selected logFC threshold, respectively.
(numGenesChanged <- dea_summary_ngc(listofcoef=listofcoef, listofcsv=listofcsv, B_thr=c(0), Pval_thr=c(0.01,0.05,0.1), adjPval_thr=c(0.01,0.05,0.15,0.25), logFC_thr=0, outputDir=resultsDir))
## Stat PD1vsCTL CMPvsCTL PD1vsCMP
## 1 B0_Up 0 117 10
## 2 B0_Down 1 14 77
## 3 adj.P.Val0.01_Up 0 3 0
## 4 adj.P.Val0.01_Down 0 0 7
## 5 adj.P.Val0.05_Up 0 50 0
## 6 adj.P.Val0.05_Down 0 1 27
## 7 adj.P.Val0.15_Up 0 274 10
## 8 adj.P.Val0.15_Down 0 94 81
## 9 adj.P.Val0.25_Up 0 669 25
## 10 adj.P.Val0.25_Down 0 416 144
## 11 P.Value0.01_Up 183 583 147
## 12 P.Value0.01_Down 194 319 343
## 13 P.Value0.05_Up 820 1244 705
## 14 P.Value0.05_Down 952 1258 979
## 15 P.Value0.1_Up 1425 1815 1342
## 16 P.Value0.1_Down 1811 2190 1564
For an extended table involving different logFC thresholds, use function dea_summary_ngc_ext instead. The output will be a list with the summary tables obtained for each comparison.
ngc_list <- dea_summary_ngc_ext(listofcoef=listofcoef, listofcsv=listofcsv, B_thr.e=c(0), Pval_thr.e=c(0.01,0.05,0.1), adjPval_thr.e=c(0.01,0.05,0.15,0.25), logFC_thr.e=c(0,0.5,1,1.5,2), outputDir=resultsDir)
names(ngc_list)
## [1] "PD1vsCTL" "CMPvsCTL" "PD1vsCMP"
head(ngc_list[["PD1vsCTL"]])
## stat_type stat_thr logFC_thr NumGenes NumGenesUp NumGenesDown
## 1 B 0.00 0.0 1 0 1
## 2 B 0.00 0.5 1 0 1
## 3 B 0.00 1.0 0 0 0
## 4 B 0.00 1.5 0 0 0
## 5 B 0.00 2.0 0 0 0
## 6 adj.P.Val 0.01 0.0 0 0 0
The two functions mentioned above are combined into a single function dea_summary_ngc1, where parameter extended
specifies whether the extended tables are to be performed. The output
is a list of two elements, the first containing the simple
numGenesChanged dataframe and the second a list with the extended tables
if performed.
numGenesChanged_all <- dea_summary_ngc1(listofcoef=listofcoef, listofcsv=listofcsv, B_thr=c(0), Pval_thr=c(0.01,0.05,0.1), adjPval_thr=c(0.01,0.05,0.15,0.25), logFC_thr=0, extended=TRUE, B_thr.e=c(0), Pval_thr.e=c(0.01,0.05,0.1), adjPval_thr.e=c(0.01,0.05,0.15,0.25), logFC_thr.e=c(0,0.5,1,1.5,2), outputDir=resultsDir)
names(numGenesChanged_all)
## [1] "numGenesChanged" "numGenesChanged_extended"
The results of the differential expression analysis obtained for each comparison can be visualized in a volcano plot using dea_volcanoplot()
function. The volcano plot is a type of scatterplot that shows
statistical significance (P value) versus magnitude of change (fold
change). It enables quick visual identification of genes with large fold
changes that are also statistically significant. The top significant
genes are shown in purple according to the thresholds of logFC and
pvalue specified in parameters volc_logFC, volc_pval and volc_pval.thr,
with defaults set to an absolute logFoldChange above 1 and adjusted
p-value below 0.05. In addition, Gene symbols are shown for the most
significant genes, with a maximum of 20. The plots are saved as pdf into
the outputDir specified and shown in the graphical device.
dea_volcanoplot(listofcsv=listofcsv, listofcoef=listofcoef, volc_logFC=rep(1,length(listofcoef)), volc_pval=c(rep("adj.P.Val", length(listofcoef))), volc_pval.thr=rep(0.05, length(listofcoef)), volc_x0=rep(-3, length(listofcoef)), volc_x1=rep(+3, length(listofcoef)), volc_y0=rep(0, length(listofcoef)), volc_y1=rep(10, length(listofcoef)), n=6, cols=2, outputDir=resultsDir, label="")
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
## [[1]]
##
## [[2]]
##
## [[3]]
All the steps performed in Block 02- DEA can be executed in a single step using dea_all1() function. The result is a list with objects corresponding to each step of the analysis:
dea.results <- dea_all(data=eset_analysis, targets=pData(eset_analysis), sampleNames="ShortName", group=design.mainfact, cov.fact=design.cofact, cov.num=design.num, fmla=NULL, block.cor=blocking.fact, contrastsv=contrastsv, moderated=TRUE, padjust.method="fdr", html_report=FALSE, html_ntop=500, html_group="Group", B_thr=c(0), Pval_thr=c(0.01,0.05,0.1), adjPval_thr=c(0.01,0.05,0.15,0.25), logFC_thr=0, extended=TRUE, B_thr.e=c(0), Pval_thr.e=c(0.01,0.05,0.1), adjPval_thr.e=c(0.01,0.05,0.15,0.25), logFC_thr.e=c(0,0.5,1,1.5,2), dovolcanoplot=TRUE, volc_logFC=rep(1,length(contrastsv)), volc_pval=c(rep("adj.P.Val", length(contrastsv))), volc_pval.thr=rep(0.05, length(contrastsv)), volc_x0=rep(-3, length(contrastsv)), volc_x1=rep(+3, length(contrastsv)), volc_y0=rep(0, length(contrastsv)), volc_y1=rep(10, length(contrastsv)), n=6, cols=2, label="", summaryFN=resultsSummFN, outputDir=resultsDir)
## NULL
## NULL
## NULL
## NULL
## NULL
## NULL
names(dea.results)
## [1] "fit" "fit.main" "listofcsv" "numGenesChanged"
dea.results$fit.main$design
## CTL PD1 CMP Batch2
## CTL.1 1 0 0 0
## CTL.2 1 0 0 1
## CTL.3 1 0 0 0
## CTL.4 1 0 0 1
## PD1.1 0 1 0 0
## PD1.2 0 1 0 1
## PD1.3 0 1 0 0
## PD1.4 0 1 0 1
## CMP.1 0 0 1 0
## CMP.3 0 0 1 0
## CMP.4 0 0 1 1
## attr(,"assign")
## [1] 1 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$Group
## [1] "contr.treatment"
##
## attr(,"contrasts")$Batch
## [1] "contr.treatment"
dea.results$fit.main$contrasts
## Contrasts
## Levels PD1vsCTL CMPvsCTL PD1vsCMP
## CTL -1 -1 0
## PD1 1 0 1
## CMP 0 1 -1
## Batch2 0 0 0
#extract objects for further analysis:
fit.main <- dea.results$fit.main
listofcsv <- dea.results$listofcsv
numGenesChanged_all <- dea.results$numGenesChanged
save(listofcoef, listofcsv, eset_norm, eset_analysis, targets, file=file.path(dataDir,"afterTopTabs.Rda"), version=2)
It is interesting to look for common patterns of regulation between different experimental conditions. In order to find the degree of overlapped genes among different comparisons, a multiple comparisons analysis has been performed. Venn diagrams and heatmaps have been plotted for the different multiple comparisons.
#Parameters
source(file.path(paramDir, "03-MC.par"))
#Execution parameters
source(file.path(paramDir, "03-MC_analysistodo.par"))
For this example study, parameters were set as follows:
(loadMC_parameters <- read.table(file.path(paramDir, "03-MC.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 dataDirN <- dades
## 2 resultsDirN <- results
## 3 resultsSummFN <- ResultsSummary_MC.txt
## 4 inputMCRda <- afterTopTabs.Rda
## 5 venn_comparNames <- c(EffectTreatment)
## 6 venn_compar <- list(1:3)
## 7 venn_pval <- c(adj.P.Val)
## 8 venn_pval.thr <- c(0.05)
## 9 venn_logFC <- c(1)
## 10 venn_include <- list(abs)
## 11 venn_pos = list(c(0, 0, 180))
## 12
## 13 colFeat = Gene.Symbol
## 14 venn_FC_col = logFC
## 15 venn_FC.thr <- c(1)
## 16 hm_comparNames <- c(EffectTreatment, EffectPD1)
## 17 hm_compar <- list(1:3, 1)
## 18 hm_groupexclude <- list(c(), c(CMP))
## 19 hm_pval <- c(adj.P.Val, P.Value)
## 20 hm_pval.thr <- c(0.05, 0.05)
## 21 hm_logFC <- logFC
## 22 hm_logFC.thr <- c(1, 1)
## 23 hm_palette <- colorRampPalette(c(blue, white, red))(n = 199)
## 24 hm_clustCol <- TRUE
## 25 hm_clustCol.dist <- euclidian
## 26 hm_clustCol.method <- complete
## 27 hm_clustRow.cor <- TRUE
## 28 batcheffect <- FALSE
## 29 batchcolName <- Batch
## 30 batchcolName2 <- NULL
## 31 batchNumcolName <- NULL
## 32 featureCol <- Gene.Symbol
(loadMC_todoparameters <- read.table(file.path(paramDir, "03-MC_analysistodo.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 Venn_plots <- TRUE
## 2 Venn_sharedElems_extended <- TRUE
## 3 Heatmap_plots <- TRUE
## 4 hm_plots_interactive <- TRUE
Since directories could change from block to block, we set them again for this block:
dataDir <- file.path(workingDir, dataDirN)
resultsDir <- file.path(workingDir, resultsDirN)
rdaDir <- dataDir
The Rda file generated in Block 02-Differential Expression Analysis (DEA) will be used as input for MC:
load(file.path(dataDir, inputMCRda))
Function mc_venn_upset() can be used to perform Venn diagrams and/or Upset plots for the comparisons analyzed.
listsharedelems <- lapply(seq_along(venn_comparNames), function(v) {
setwd(resultsDir)
namescomp <- names(listofcsv)[venn_compar[[v]]]
listsharedelems <- mc_venn_upset(listofcsv=listofcsv, namescomp=namescomp, label = venn_comparNames[v], colFeat = colFeat[v],
colPval = venn_pval[v], pval = venn_pval.thr[v], colFC=venn_FC_col[v], FC=venn_FC.thr[v], include=venn_include[v], pltR = TRUE,
pltPdf = TRUE, pltPng=FALSE, venn = TRUE, eul = FALSE, saveTables = TRUE, upsetPlot=FALSE,
saveTables_extended=Venn_sharedElems_extended,
colors = rainbow(length(namescomp)), trans = 0.5,
cex1 = 1, rotation=0, position=venn_pos[[v]], cex2 = 1, resultsSummFN=resultsSummFN, outputDir=resultsDir)
setwd(workingDir)
return(listsharedelems)
})
names(listsharedelems) <- venn_comparNames
knitr::include_graphics(file.path(resultsDir, "VennDiagram.abs.EffectTreatment.adj.P.Val0.05.logFC1.pdf"))
Function mc_hm() can be used to perform Heatmap plots for the comparisons analyzed.
mc_hm(listofcsv=listofcsv, hm_dataset=listofcsv[[1]], targets=targets.f, featureCol=featureCol, hm_comparNames=hm_comparNames, hm_compar=hm_compar, hm_groupexclude=hm_groupexclude, hm_pval=hm_pval, hm_pval.thr=hm_pval.thr, hm_logFC=hm_logFC, hm_logFC.thr=hm_logFC.thr, hm_palette=hm_palette, hm_clustCol=hm_clustCol, hm_clustCol.dist=hm_clustCol.dist, hm_clustCol.method=hm_clustCol.method, hm_clustRow.cor=hm_clustRow.cor, batcheffect=batcheffect, batchcolName=batchcolName, batchcolName2=batchcolName2, batchNumcolName=batchNumcolName, hm_plots_interactive=TRUE, resultsSummFN=resultsSummFN, outputDir=resultsDir)
## $EffectTreatment
##
## $EffectPD1
The analysis of biological significance has been based on gene set enrichment analysis (GSEA) @Subramanian2005 on different annotation databases. The analysis has been performed over two annotation databases: the "Gene Ontology" (GO) and the Reactome Pathway Knowledge base @reactome:2018. The goal of this analysis is to perform one of the available statistical tests to determine whether a given Gene Set, usually a particular category of the GO or pathway in Reactome, is over-represented in the list of selected genes (the sample) with respect (i.e. compared) to a reference set (the population) from where it has been selected. The reference set is usually taken to be all the genes analyzed in the study. In GSEA, all the genes analyzed in the study ( filtered genes) are ranked by log fold change an used in the analysis. GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way @Subramanian2005.
#Parameters
source(file.path(paramDir, "04-ABS-GSEA.par"))
#Execution parameters
source(file.path(paramDir, "04-ABS-GSEA_analysistodo.par"))
For this example study, parameters were set as follows:
(loadABSGSEA_parameters <- read.table(file.path(paramDir, "04-ABS-GSEA.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 dataDirN <- dades
## 2 resultsDirN <- results
## 3 resultsSummFN <- ResultsSummary_ABS-GSEA.txt
## 4 inputABSGSEARda <- afterTopTabs.Rda
## 5 annotPackage <- clariomsmousetranscriptcluster.db
## 6 organism_annot <- org.Mm.eg.db
## 7 organism <- mouse
## 8 label=
## 9 resultsSummFN <- ResultsSummary_ABS-GSEA.txt
## 10 ranking_metric <- logFC
## 11 geneColname <- EntrezID
## 12 keyType <- ENTREZID
## 13 label <-
## 14 readable <- FALSE
## 15 saveGMT <- TRUE
## 16 saveTables <- TRUE
## 17 saveRda <- TRUE
## 18 GO_comparisons <- c(1:3)
## 19 GO_pvalcutoff <- list(rep(0.05, 3), rep(0.05, 3), rep(0.05, 3))
## 20 GO_pAdjustMethod <- list(rep(BH, 3), rep(BH, 3), rep(BH, 3))
## 21 GO_minSetSize <- 3
## 22 GO_maxSetSize <- 500
## 23 Reac_comparisons <- GO_comparisons
## 24 Reac_pvalcutoff <- rep(0.05, 3)
## 25 Reac_pAdjustMethod <- rep(BH, 3)
## 26 Reac_minSetSize <- 3
## 27 Reac_maxSetSize <- 500
(loadABSGSEA_todoparameters <- read.table(file.path(paramDir, "04-ABS-GSEA_analysistodo.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 GOAnalysis <- TRUE
## 2 GOAnalysis.unfilt <- GOAnalysis
## 3 GOAnalysis.filt <- GOAnalysis
## 4 GOPlots <- TRUE
## 5 GO_categories <- list(BP,CC,MF)
## 6 ReactomeAnalysis <- TRUE
## 7 ReactomeAnalysis.unfilt <- ReactomeAnalysis
## 8 ReactomeAnalysis.filt <- ReactomeAnalysis
## 9 ReactomePlots <- TRUE
Since directories could change from block to block, we set them again for this block:
dataDir <- file.path(workingDir, dataDirN)
resultsDir <- file.path(workingDir, resultsDirN)
rdaDir <- dataDir
gmtDir <- resultsDir
The Rda file generated in Block 02-Differential Expression Analysis (DEA) will be used as input for GSEA:
load(file.path(dataDir, inputABSGSEARda))
Function abs_gsea_GOunfilt() can be used to perform GO
analysis for all the comparisons and store unfiltered results (no pvalue
threshold) in lists. The necessary inputs are the toptables generated
in the DEA analysis and as output, the analysis will give you lists of
GO unfiltered results and a table with number of terms found at
different pvalue thresholds for the different categories analyzed.
namescomp <- names(listofcsv)[GO_comparisons]
GSEA_GOresults.unfilt <- abs_gsea_GOunfilt(listofTopNamed=listofcsv, namescomp=namescomp, annotPackage=annotPackage, organism_annot= organism_annot, ranking_metric=ranking_metric, geneColname=geneColname, keyType=keyType, GO_minSetSize=GO_minSetSize, GO_maxSetSize=GO_maxSetSize, resultsSummFN=resultsSummFN, saveRda=saveRda, saveGMT=saveGMT, label=label, outputDir=resultsDir, rdaDir=rdaDir, gmtDir=gmtDir)
Function abs_gsea_GOfiltplot() filters GO results
according to the pvalue threshold set previously in parameters. It takes
as input the lists of GO unfiltered results and pvalue thresholds. As
an output, dataframes of GO results filtered by pvalue are obtained. The
function also performs a GO significance analysis, giving some plots
with the most enriched GO terms pathways.
GSEA_GOresults.filt <- abs_gsea_GOfiltplot(GSEAresGO=GSEA_GOresults.unfilt, GO_pvalcutoff=GO_pvalcutoff, GO_pAdjustMethod=GO_pAdjustMethod, saveTables=saveTables, GOPlots=GOPlots, label=label, resultsSummFN=resultsSummFN, outputDir=resultsDir)
Function abs_gsea_ReactomePAunfilt() can be used to
perform Reactome analysis for all the comparisons and store unfiltered
results (no pvalue threshold) in lists. The necessary inputs are the
toptables generated in the DEA analysis and as output, the analysis will
give you lists of Reactome unfiltered results and a table with number
of terms found at different pvalue thresholds for the different
categories analyzed.
namescomp <- names(listofcsv)[Reac_comparisons]
GSEA_ReactomePA.unfilt <- abs_gsea_ReactomePAunfilt(listofTopNamed=listofcsv, namescomp=namescomp, organism=organism, organism_annot=organism_annot, Reac_minSetSize=Reac_minSetSize, Reac_maxSetSize=Reac_maxSetSize, resultsSummFN=resultsSummFN, saveRda=saveRda, saveGMT=saveGMT, label=label, outputDir=resultsDir, rdaDir=rdaDir, gmtDir=gmtDir)
Function abs_gsea_ReactomePAfiltplot() filters Reactome
results according to the pvalue threshold set previously in parameters.
It takes as input the lists of Reactome unfiltered results and pvalue
thresholds. As an output, dataframes of Reactome results filtered by
pvalue are obtained. The function also performs a Reactome significance
analysis, giving some plots with the most enriched Reactome pathways.
abs_gsea_ReactomePAfiltplot(GSEAresReac=GSEA_ReactomePA.unfilt, Reac_pvalcutoff=Reac_pvalcutoff, Reac_pAdjustMethod=Reac_pAdjustMethod, saveTables=saveTables, ReacPlots=ReactomePlots, label=label, resultsSummFN=resultsSummFN, outputDir=resultsDir)
An overrepresentation analysis (ORA) will be performed over GO-BP and Reactome Pathways databases to find the terms enriched in the lists of genes up-regulated/down-regulated in common between the comparisons analyzed.
#Parameters
source(file.path(paramDir, "04-ABS-ORA.par"))
#Execution parameters
source(file.path(paramDir, "04-ABS-ORA_analysistodo.par"))
For this example study, parameters were set as follows:
(loadABSORA_parameters <- read.table(file.path(paramDir, "04-ABS-ORA.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 dataDirN <- dades
## 2 resultsDirN <- results
## 3 resultsSummFN <- ResultsSummary_ABS-ORA.txt
## 4 inputABSORARda <- afterTopTabs.Rda
## 5 annotPackage <- clariomsmousetranscriptcluster.db
## 6 organism_annot <- org.Mm.eg.db
## 7 organism <- mouse
## 8 label=
## 9 resultsSummFN <- ResultsSummary_ABS-ORA.txt
## 10 universe <- NULL
## 11 sign <- c(1,-1)
## 12 keyType <- ENTREZID
## 13 geneColname <- EntrezID
## 14 GO_categories <- list(BP)
## 15 GO_minSetSize <- 3
## 16 GO_maxSetSize <- 500
## 17 selected.pval.go <- c(P.Value,P.Value,P.Value)
## 18 selected.pvalthr.go <- c(0.05,0.05,0.05)
## 19 selected.logFC.go <- c(1,1,1)
## 20 col_logFC.go <- c(logFC)
## 21 sign_logFC.go <- c(abs)
## 22 GOPlots <- TRUE
## 23 GO_pvalcutoff <- rep(list(c(BP=0.15, CC=0.15, MF=0.15)),length(namescomp))
## 24 GO_pAdjustMethod <- rep(list(c(BP=BH, CC=BH, MF=BH)),length(namescomp))
## 25
## 26 col_entrez <- EntrezID
## 27 Reac_minSetSize <- 3
## 28 Reac_maxSetSize <- 500
## 29 Reac_pvalcutoff <- rep(0.15,length(namescomp))
## 30 Reac_pAdjustMethod <- rep(BH,length(namescomp))
## 31 ReacPlots <- TRUE
## 32 selected.pval.reac <- c(P.Value,P.Value,P.Value)
## 33 selected.pvalthr.reac <- c(0.05,0.05,0.05)
## 34 selected.logFC.reac <- c(1,1,1)
## 35 col_logFC.reac <- c(logFC)
## 36 sign_logFC.reac <- c(abs)
(loadABSORA_parameters <- read.table(file.path(paramDir, "04-ABS-ORA_analysistodo.par"), header = FALSE, sep=";", as.is=TRUE))
## V1
## 1 readable <- TRUE
## 2 saveGMT <- TRUE
## 3 saveTables <- TRUE
## 4 saveRda <- TRUE
## 5 GOAnalysis <- TRUE
## 6 GOAnalysis.unfilt <- GOAnalysis
## 7 GOAnalysis.filt <- GOAnalysis
## 8 GOPlots <- TRUE
## 9 GO_categories <- list(BP,CC,MF)
## 10 ReactomeAnalysis <- TRUE
## 11 ReactomeAnalysis.unfilt <- ReactomeAnalysis
## 12 ReactomeAnalysis.filt <- ReactomeAnalysis
## 13 ReactomePlots <- TRUE
Since directories could change from block to block, we set them again for this block:
dataDir <- file.path(workingDir, dataDirN)
resultsDir <- file.path(workingDir, resultsDirN)
rdaDir <- dataDir
gmtDir <- resultsDir
The Rda file generated in Block 02-Differential Expression Analysis (DEA) will be used as input for ABS-ORA:
load(file.path(dataDir, inputABSORARda))
Function abs_ora_GOunfilt() can be used to perform an
overrepresentation GO analysis for all the comparisons and store
unfiltered results (no pvalue threshold) in lists. The necessary inputs
are the toptables generated in the DEA analysis and as output, the
analysis will give you lists of GO unfiltered results and a table with
number of terms found at different pvalue thresholds for the different
categories analyzed.
ORA_GOresults.unfilt <- abs_ora_GOunfilt(listofTopNamed=listofcsv, namescomp=namescomp, universe=universe, geneColname=geneColname, readable=readable, selected.pval.go=selected.pval.go, selected.pvalthr.go=selected.pvalthr.go, selected.logFC.go=selected.logFC.go, col_logFC.go=col_logFC.go, sign_logFC.go=sign_logFC.go, organism_annot=organism_annot, keyType=keyType, GO_categories=GO_categories, GO_minSetSize=GO_minSetSize, GO_maxSetSize=GO_maxSetSize, resultsSummFN=resultsSummFN, saveRda=saveRda, saveGMT=saveGMT, label=label, outputDir=resultsDir, rdaDir=rdaDir, gmtDir=gmtDir)
Function abs_ora_GOfiltplot() filters GO results
according to the pvalue threshold set previously in parameters. It takes
as input the lists of GO unfiltered results and pvalue thresholds. As
an output, dataframes of GO results filtered by pvalue are obtained. The
function also performs a GO significance analysis, giving some plots
with the most enriched GO terms pathways.
ORA_GOresults.filt <- abs_ora_GOfiltplot(ORAresGO=ORA_GOresults.unfilt, sign=sign, GO_pvalcutoff=GO_pvalcutoff, GO_pAdjustMethod=GO_pAdjustMethod, saveTables=saveTables, GOPlots=GOPlots, label=label, resultsSummFN=resultsSummFN, outputDir=resultsDir)
Function abs_ora_ReactomePAunfilt() can be used to
perform an overrepresentation Reactome analysis for all the comparisons
and store unfiltered results (no pvalue threshold) in lists. The
necessary inputs are the toptables generated in the DEA analysis and as
output, the analysis will give you lists of Reactome unfiltered results
and a table with number of terms found at different pvalue thresholds
for the different categories analyzed.
ORA_ReactomePAresults.unfilt <- abs_ora_ReactomePAunfilt(listofTopNamed=listofcsv, namescomp=namescomp, universe=universe, col_entrez=col_entrez, readable=readable, selected.pval.reac=selected.pval.reac, selected.pvalthr.reac=selected.pvalthr.reac, selected.logFC.reac=selected.logFC.reac, col_logFC.reac=col_logFC.reac, sign_logFC.reac=sign_logFC.reac, organism=organism, organism_annot=organism_annot, keyType=keyType, Reac_minSetSize=Reac_minSetSize, Reac_maxSetSize=Reac_maxSetSize, resultsSummFN=resultsSummFN, saveRda=saveRda, saveGMT=saveGMT, label=label, outputDir=resultsDir, rdaDir=rdaDir, gmtDir=gmtDir)
Function abs_ora_ReactomePAfiltplot() filters Reactome
results according to the pvalue threshold set previously in parameters.
It takes as input the lists of Reactome unfiltered results and pvalue
thresholds. As an output, dataframes of Reactome results filtered by
pvalue are obtained. The function also performs a Reactome significance
analysis, giving some plots with the most enriched Reactome pathways.
ORA_ReactomePAanalysis_filt <- abs_ora_ReactomePAfiltplot(ORAresReac=ORA_ReactomePAresults.unfilt, sign=sign, Reac_pvalcutoff=Reac_pvalcutoff, Reac_pAdjustMethod=Reac_pAdjustMethod, saveTables=saveTables, ReacPlots=ReacPlots, label=label, resultsSummFN=resultsSummFN, outputDir=resultsDir)